3  Results

3.1 Summary of findings

In the next few subsections, we dive into the details of how we drew these conclusions.

3.2 Data pre-processing

To conduct deeper analysis on our data, we added the following columns to our dataset:

  1. BMI: the Body Mass Index of each person, a continuous variable calculated using the World Health Organization guideline: \(BMI = \frac{Weight}{Height^2}\).

  2. IsOverweight: a binary categorical value whose value is 1 if the BMI of the person is more than 25, 0 otherwise.

These columns serve as alternate metrics of the obesity of an individual, a continuous variable and a binary categorical variable, to complement NObeyesdad, the obesity category level which is a ordinal categorical variable.

These new columns can be found and used directly in the dataset scripts/xinyi-zhao_files/CleanObesityDataSet.csv.

In order to improve the visualization of certain plots, we also made local modifications to the data types of some columns for only that given plot. For instance, to plot the alluvial diagrams (figure 10), we rounded the values of the following: Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Physical activity frequency (FAF), Time using technology devices (TUE) (some of the values of these variables were floats, presumably because the researchers took an average of multiple time periods). This made sure that the variables are discrete (whole numbers).

3.3 Analysis of Height, Weight and BMI variables

Since the dataset contains artificially generated data, we first needed to verify that the dataset matches our expectations, notably by visualizing the distribution of these three continuous variables.

Code
# Importing the necessary libraries and dataset
library(readr)
library(dplyr)
library(tidyverse)
library(stats)
library(vcd)
library(ggplot2)
library(gridExtra)
library(grid)
library(GGally)
library(reshape2)
library(colorspace)
library(plyr)
library(ggalluvial) 
library(ggridges)

data <- read_csv('scripts/xinyi-zhao_files/CleanObesityDataSet.csv',show_col_types = FALSE)

3.3.1 Verification of number of rows, columns, missing values

A preliminary verifications of the number of rows, columns and missing values of our dataset show that there does not seem to be any problems with the structure of our dataset.

Code
cat("Number of rows in dataset: ", nrow(data), "\nNumber of columns in dataset: ", ncol(data),"\nNumber of missing values in dataset: ", sum(is.na(data)))
Number of rows in dataset:  2111 
Number of columns in dataset:  20 
Number of missing values in dataset:  0

3.3.2 Histogram

Additionally, the data seems to align with our expectations, visualizing the Height, Weight and BMI variables with a histogram (Fig. 1) shows that at first glance there is no anomalous data that stands out with respect to the people surveyed.

Code
# Histograms of Height, Weight and BMI

plot1 <- ggplot(data = data, aes(x = Height)) + geom_histogram(binwidth = 0.02)
plot2 <- ggplot(data = data, aes(x = Weight)) + geom_histogram(binwidth = 2)
plot3 <- ggplot(data = data, aes(x = BMI)) + geom_histogram(binwidth = 1)
grid.arrange(plot1, plot2, plot3, ncol = 3,
     top = textGrob("Histograms of the continuous variables Height, Weight and BMI",gp=gpar(fontsize=16)))

Fig. 1: Histograms of Height, Weight and BMI

3.3.3 Normality tests (Shapiro-Wilk test and QQ-plots)

We investigated whether three variables plotted in Fig. 1 follow a normal distribution by running the Shapiro–Wilk test (below) and plotting the QQ-plots of all three variables (Fig. 2).

Code
shapiro.test(data$Height)

    Shapiro-Wilk normality test

data:  data$Height
W = 0.99323, p-value = 2.772e-08
Code
shapiro.test(data$Weight)

    Shapiro-Wilk normality test

data:  data$Weight
W = 0.9765, p-value < 2.2e-16
Code
shapiro.test(data$BMI)

    Shapiro-Wilk normality test

data:  data$BMI
W = 0.97475, p-value < 2.2e-16
Code
plot1 <- ggplot(data, aes(sample = Height)) + 
  stat_qq() + 
  stat_qq_line(col = "navy") +
  xlab("Height")

plot2 <- ggplot(data, aes(sample = Weight)) + 
  stat_qq() + 
  stat_qq_line(col = "navy") +
  xlab("Weight")

plot3 <- ggplot(data, aes(sample = BMI)) + 
  stat_qq() + 
  stat_qq_line(col = "navy") +
  xlab("BMI")

grid.arrange(plot1, plot2, plot3, ncol = 3)

Fig. 2: QQ-Plot of Height, Weight and BMI

Analysis of normality:

On the histogram, it seems like Weight does not follow a normal distribution as there are multiple local peaks at around 75kg, 90kg and 100kg), and neither does BMI since there is nobody whose BMI is less than 18 or higher than 50, hence the BMI distribution does not have a tail.

  1. The height of the individuals does not appear to follow a normal distribution, as it appears to be multimodal (1.75m and 1.8m) and the Shapiro-Wilk normality test gives a p-value of 2e-8 << 0.05. This is not quite aligned with our expectations as height has been found by researchers to in general follow a normal distribution.

  2. The weight of the individuals appears to have a slight right skew according to the QQ Plot (the plot “climbs slowly” for smaller values of the x axis). This is aligned with our expectations as body weight is found to have a right skew according to research.

  3. The BMI of individuals appears to be in between a uniform distribution and a normal distribution from the QQ Plot (flat for small and big values on the theoretical normal quantiles, but not flat enough to be a true uniform distribution). This is not collaborated by other research that has found that the distribution of BMI is right-skewed, for instance by Briese et. al (2011). Yet, this is to be expected since the researchers artificially generated more high BMI data points to balance out the class imbalance.

3.3.4 Boxplot of BMI

Afterwards, we plotted a boxplot (Fig. 3) to better understand the other statistics (e.g. median, outliers) of the variable BMI.

Code
# Boxplot to visualize the BMI across all individuals 
ggplot(data, aes(x=BMI)) + 
  geom_boxplot(fill = "lightblue", color = "black") + 
  coord_cartesian(ylim = c(-2, 2)) +  
  labs(title = "Boxplot of BMI across all individuals")

Fig. 3: Boxplot to visualize the BMI across all individuals
Code
fivenum(data$BMI)
[1] 12.99868 24.32580 28.71909 36.01650 50.81175

The Tukey Five-Number summary tells us the following: - Minimum BMI is 13

  • Lower hinge is 24

  • Median is 28.7

  • Upper hinge is 36.0

  • Maximum BMI is 50.8

Since the median is at 28.7 > 25, more than half of the sample corresponds to overweight individuals (generated or real people). Since the lower hinge is at 24 (normal weight), we also know that less than 25% of the individuals in the dataset are underweight, and also less than 25% of the individuals in the dataset are of normal weight.

Therefore, this dataset has much more data of overweight people than expected, which makes sense since the obesity level category divides “overweight” into five different categories (Overweight I & II, Obesity I, II, III).

3.3.5 Ridgeline plot of weight

Lastly, we made a ridgeline plot of weight of individuals across Obesity Types (Fig. 4).

Code
ggplot(data, aes(x = Weight, y = NObeyesdad, fill = NObeyesdad)) +
  stat_density_ridges(alpha = 0.7) +
  scale_fill_viridis_d() +
  labs(title = "Ridgeline Plot of Obesity Types", x = "weight", y = "Obesity Type") +
  theme_ridges()

Fig. 4: Ridgeline Plot of weight of individuals across Obesity Types

3.4 Preliminary visualizations

Code
set.seed(123456)

data <- data %>%
 mutate_if(is.numeric, scale)

data_num <- data[c("Age", "Height", "Weight", "NCP", "CH2O", "FAF", "TUE","BMI")]

# Convert all columns to numeric
data_num <- sapply(data_num, as.numeric)

# Sample 1000 points for clearer visualization
data_sampled <- sample_n(as.data.frame(data_num), 300)

ggpairs(as.data.frame(data_sampled), aes(alpha = 0.4, size = 0.02), 
           upper = list(continuous = wrap("cor", size = 12))) + 
  theme(text = element_text(size = 35))

Fig. 5: Scatterplot of each pair of numeric variables in the dataset
Code
categories <- c('Insufficient_Weight','Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II','Obesity_Type_III')

# Create a mapping from category to integer
category_mapping <- setNames(1:length(categories), categories)

data$NObesity_numeric <- as.integer(factor(data$NObeyesdad, levels = categories))

data_pca <- data[c("FCVC","NCP","FAF","TUE","NObesity_numeric")]


cor_matrix <- cor(data_pca)

# Reshape the data for heatmap
melted_cor <- melt(cor_matrix)

# Creating heatmap
ggplot(melted_cor, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = ifelse(value > 0.1&value<1, round(value, 2), '')), color= 'darkblue',vjust = 1) +
  scale_fill_gradient2(midpoint = 0) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Heatmap of Correlations")

Fig. 6: Heatmap of correlations between numerical variables
Code
# Creating individual density plots with smooth lines
plot_fcvc <- ggplot(data, aes(x = FCVC)) + 
  stat_density(aes(y = ..density..), geom = "area", fill="blue", alpha=0.5) +
  ggtitle("FCVC Density Plot") +
  xlab("FCVC") +
  ylab("Density")

plot_ncp <- ggplot(data, aes(x = NCP)) + 
  stat_density(aes(y = ..density..), geom = "area", fill="blue", alpha=0.5) +
  ggtitle("NCP Density Plot") +
  xlab("NCP") +
  ylab("Density")

plot_faf <- ggplot(data, aes(x = FAF)) + 
  stat_density(aes(y = ..density..), geom = "area", fill="blue", alpha=0.5) +
  ggtitle("FAF Density Plot") +
  xlab("FAF") +
  ylab("Density")

plot_tue <- ggplot(data, aes(x = TUE)) + 
  stat_density(aes(y = ..density..), geom = "area", fill="blue", alpha=0.5) +ggtitle("TUE Density Plot") +
  xlab("TUE") +
  ylab("Density")

# Arrange the plots in a 2x2 grid
grid.arrange(plot_fcvc, plot_ncp, plot_faf, plot_tue, nrow = 2)

Fig. 7: Density plots for Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Physical activity frequency (FAF), Time using technology devices (TUE)

3.5 Understanding the correlation between biological factors and obesity

3.6 Correlation between lifestyle habits and obesity

Code
# Create a mosaic plot for each variable in the table plotted against 'IsOverweight'

data_mosaic <- data[c("Gender", "family_history_with_overweight","FAVC","SMOKE","SCC","IsOverweight")]

data_mosaic$family_history_with_overweight <- fct_relevel(data_mosaic$family_history_with_overweight, "yes", "no")

data_mosaic$SMOKE <- fct_relevel(data_mosaic$SMOKE, "yes", "no")

data_mosaic$FAVC <- fct_relevel(data_mosaic$FAVC, "yes", "no")

data_mosaic$SCC <- fct_relevel(data_mosaic$SCC, "yes", "no")


data_mosaic$IsOverweight <- as.character(data_mosaic$IsOverweight)

for(var in names(data_mosaic)[names(data_mosaic) != "IsOverweight"]) {
# png(paste0("mosaic_", var, "_IsOverweight.png"))
 contingency_table <- table(data_mosaic[[var]],data_mosaic$IsOverweight)
 mosaicplot(contingency_table, main = paste("Mosaic Plot for", var), ylab = "Is Overweight", xlab = var, col = c("pink", "brown"))
# dev.off()
}

Fig. 8a: Mosaic Plot of Gender against IsOverweight

Fig. 8b: Mosaic Plot of Family History against IsOverweight

Fig. 8c: Mosaic Plot of Frequent consumption of high caloric food against IsOverweight

Fig. 8d: Mosaic Plot of weight of whether the person smokes against IsOverweight

Fig. 8e: Mosaic Plot of calories consumption monitoring against IsOverweight
Code
# plot 1 mosaic 3 var
library(colorspace)
library(plyr)

data_mosaic <- data[c("FAVC", "SCC", "IsOverweight")] 
data_mosaic <- dplyr::count_(data_mosaic,vars = c("FAVC", "SCC", "IsOverweight"))

colnames(data_mosaic)[colnames(data_mosaic) == "n"] <- "Freq"

data_mosaic$IsOverweight <- fct_rev(as.character(data_mosaic$IsOverweight))

mosaic(FAVC ~ SCC + IsOverweight, data = data_mosaic, direction = c("v","h","v"), highlighting_fill = c("#bdd7e7","#2171b5"))

Fig. 9: Trivariate mosaic plot visualizing frequent consumption of high caloric food, calories consumption monitoring and IsOverweight
Code
data_alluvial <- data[c("Age", "NObeyesdad", "NCP", "CH2O", "FAF", "CALC", "TUE")]

data_alluvial$CH2O <- as.numeric(data_alluvial$CH2O)
data_alluvial$CH2O <- round(data_alluvial$CH2O)
data_alluvial$NCP <- as.numeric(data_alluvial$NCP)
data_alluvial$NCP <- round(data_alluvial$NCP)
data_alluvial$FAF <- as.numeric(data_alluvial$FAF)
data_alluvial$FAF <- round(data_alluvial$FAF)
data_alluvial$TUE <- as.numeric(data_alluvial$TUE)
data_alluvial$TUE <- round(data_alluvial$TUE)

data_alluvial <- data_alluvial %>%
 mutate(Obesity_Category = case_when(
   NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II") ~ "Overweight",
   NObeyesdad %in% c("Obesity_Type_I", "Obesity_Type_II") ~ "Obese",
   NObeyesdad == "Normal_Weight" ~ "Normal",
   NObeyesdad == "Insufficient_Weight" ~ "Underweight"
 ))
Code
data_alluvial_2 <- dplyr::count_(data_alluvial,vars = c("Obesity_Category", "FAF", "CALC", "NCP", "CH2O", "TUE"))

colnames(data_alluvial_2)[colnames(data_alluvial_2) == "n"] <- "Freq"

data_alluvial_2$Obesity_Category <- factor(data_alluvial_2$Obesity_Category, levels = c("Underweight", "Normal", "Overweight", "Obese"))

ggplot(as.data.frame(data_alluvial_2), aes(y = Freq, axis1 = Obesity_Category, axis2 = TUE, axis3 = FAF)) +
  geom_alluvium(aes(fill = Obesity_Category), width = 1/12) +
  geom_stratum(width = 1/12, fill = "grey80", color = "grey") +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
  scale_x_discrete(limits = c("Gender", "Dept"), expand = c(.05, .05)) +
  scale_fill_brewer(type = "qual", palette = "Set1")  + 
  expand_limits(x = 0, y = 0) +
  ggtitle("Obesity category vs Time on Electronics vs Alcohol consumption") + 
  theme_void() +
  theme(plot.margin = margin(0,0,0,0, "cm"))

Fig. 10a: Alluvial diagram of Obesity category vs Time on Electronics vs Alcohol consumption
Code
# CALC (consumption of alcohol), TUE (time on electronics), FAF (physical activity), CH2O (daily water), NCP (number of main meals)
ggplot(as.data.frame(data_alluvial_2), aes(y = Freq, axis1 = Obesity_Category, axis2 = NCP, axis3 = CH2O)) +
  geom_alluvium(aes(fill = Obesity_Category), width = 1/12) +
  geom_stratum(width = 1/12, fill = "grey80", color = "grey") +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
  scale_x_discrete(limits = c("Gender", "Dept"), expand = c(.05, .05)) +
  scale_fill_brewer(type = "qual", palette = "Set1")  + 
  expand_limits(x = 0, y = 0) +
  ggtitle("Obesity category vs Nb of meals vs Water consumption a day") + 
  theme_void() +
  theme(plot.margin = margin(0,0,0,0, "cm"))

Fig. 10b: Alluvial diagram of Obesity category vs Nb of meals vs Water consumption a day